Skip to content

feat(hesai): add CUDA-accelerated point cloud decoder#421

Open
k1832 wants to merge 5 commits intotier4:mainfrom
k1832:feat/core-cuda-decode
Open

feat(hesai): add CUDA-accelerated point cloud decoder#421
k1832 wants to merge 5 commits intotier4:mainfrom
k1832:feat/core-cuda-decode

Conversation

@k1832
Copy link

@k1832 k1832 commented Mar 19, 2026

PR Type

  • New Feature

Related Links

Description

Add a GPU-accelerated decode path for Hesai LiDAR sensors using CUDA. The feature is:

  • Compile-time opt-in: Build with -DBUILD_CUDA=ON. When CUDA toolkit is not found, the build silently falls back to CPU-only.
  • Runtime opt-in: Set NEBULA_USE_CUDA=1 environment variable. When unset, the existing CPU path is used with zero overhead.

What it does

  • Processes an entire scan in a single batched CUDA kernel launch (launch_decode_hesai_scan_batch)
  • Uses pre-computed angle lookup tables (azimuth/elevation) uploaded to GPU once at initialization
  • Supports calibration-based and correction-based angle correctors
  • Currently validated on OT128 (Pandar128E4X) sensor

Files changed

File Change
hesai_cuda_kernels.cu New CUDA kernel for batched point cloud decoding
hesai_cuda_decoder.hpp GPU buffer management, angle LUT, device memory
hesai_decoder.hpp Integration: GPU scan buffer, flush, result conversion
hesai_sensor.hpp Expose max_scan_buffer_points() for GPU buffer sizing
angle_corrector_*.hpp Expose angle LUT data for GPU upload
nebula_hesai_decoders/CMakeLists.txt CUDA library target, toolkit detection
nebula_hesai/CMakeLists.txt CUDA decoder test target
hesai_cuda_decoder_test.cpp 5 GPU-vs-CPU equivalence tests

Known limitations

  • GPU kernel does not set return_type field (always 0)
  • Scan boundary detection differs from CPU's ScanCutter, causing up to ~1850 points to shift between adjacent scans (out of ~72k per scan)

Review Procedure

Build (with CUDA)

colcon build --packages-up-to nebula_hesai \
  --cmake-args -DBUILD_CUDA=ON -DBUILD_TESTING=ON

Requires NVIDIA CUDA Toolkit (tested with CUDA 12.x). If the toolkit is not found, the build succeeds but CUDA support is silently disabled.

Running with CUDA enabled

The GPU decode path is gated by a runtime environment variable:

# Enable GPU decoding
export NEBULA_USE_CUDA=1

# Launch the driver node as usual — it will log "GPU scan batching enabled" on startup
ros2 launch nebula_hesai ...

# To disable (default), unset the variable
unset NEBULA_USE_CUDA

Test

# Run all tests (132 existing + 5 new CUDA tests)
source install/setup.bash
colcon test --packages-select nebula_hesai --ctest-args -V

# Or run CUDA tests only
./build/nebula_hesai/hesai_cuda_decoder_test_main

Test results

[==========] Running 5 tests from 1 test suite.
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence
[       OK ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence (21778 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty
[       OK ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty (388 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuFieldValidity
[       OK ] HesaiCudaDecoderTest.OT128_GpuFieldValidity (378 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts
[       OK ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts (369 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_IntensityExactMatch
[       OK ] HesaiCudaDecoderTest.OT128_IntensityExactMatch (17217 ms)
[  PASSED  ] 5 tests.

# Full suite
Summary: 137 tests, 0 errors, 0 failures, 0 skipped

Remarks

  • When CUDA is not compiled in (BUILD_CUDA=OFF), the 5 CUDA tests are compiled but skip at runtime via GTEST_SKIP(), so they do not break CPU-only CI.
  • Tolerances in the equivalence tests were derived from a single OT128 rosbag. See test file header for observed values.

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

  • Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.

@k1832 k1832 force-pushed the feat/core-cuda-decode branch from 580316f to cd2b0e8 Compare March 23, 2026 01:32
k1832 added 2 commits March 23, 2026 12:21
Add a GPU decode path for Hesai LiDAR sensors, gated behind compile-time
BUILD_CUDA=ON and runtime NEBULA_USE_CUDA=1 environment variable.

The implementation includes:
- CUDA kernel for batched point cloud decoding (hesai_cuda_kernels.cu)
- Angle LUT upload and GPU scan buffer management in hesai_decoder.hpp
- GPU-vs-CPU equivalence tests for OT128 (Pandar128E4X) sensor

The GPU path processes an entire scan in a single kernel launch, using
pre-computed angle lookup tables and a sparse output buffer. When CUDA
is not available or NEBULA_USE_CUDA is unset, the existing CPU path is
used with zero overhead.

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
- Copyright year 2024 -> 2026 for new files
- Replace deprecated find_package(CUDA) with find_package(CUDAToolkit)
- Remove --expt-relaxed-constexpr flag (not needed)
- Remove unused per-packet kernel and launcher (dead code)
- Batch launcher returns bool; caller logs via NEBULA_LOG_STREAM
- Reorder CudaNebulaPoint fields for better memory packing
- Remove redundant is_multi_frame member; use n_frames > 1
- Make HesaiCudaDecoder destructor virtual
- Add int32_t range guarantee comment in angle corrector

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
@k1832 k1832 force-pushed the feat/core-cuda-decode branch from cd2b0e8 to 508175b Compare March 23, 2026 03:21
@codecov
Copy link

codecov bot commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.36%. Comparing base (baf4f92) to head (18fb65c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
+ Coverage   48.34%   48.36%   +0.02%     
==========================================
  Files         156      157       +1     
  Lines       12996    13004       +8     
  Branches     6900     6903       +3     
==========================================
+ Hits         6283     6290       +7     
- Misses       5326     5327       +1     
  Partials     1387     1387              
Flag Coverage Δ
nebula_hesai 32.69% <100.00%> (?)
nebula_hesai_decoders 32.69% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace .points access with direct iteration over PointCloud<T>
(which now extends std::vector<T> instead of pcl::PointCloud).

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
@k1832 k1832 force-pushed the feat/core-cuda-decode branch from 62ab94c to 09658fe Compare March 23, 2026 03:55
pre-commit-ci bot and others added 2 commits March 23, 2026 03:56
- Add missing #include <string> in hesai_decoder.hpp
- Add missing #include <limits> in hesai_cuda_decoder_test.cpp
- Fix readability/braces warning for ifdef-guarded else block

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
@k1832 k1832 marked this pull request as ready for review March 23, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant